% ------------------------------------------------------------------
\chapter{Computational blocks}\label{s:blocks}
% ------------------------------------------------------------------

This chapters describes the individual computational blocks supported by \vlnn. The interface of a CNN computational block \verb!<block>! is designed after the discussion in Section~\ref{s:modularity}. The block is implemented as a MATLAB function \verb!y = vl_nn<block>(x,w)! that takes as input MATLAB arrays \verb!x! and \verb!w! representing the input data and parameters and returns an array \verb!y! as output. In general, \verb!x! and \verb!y! are 4D real arrays packing $N$ maps or images, as discussed above, whereas \verb!w! may have an arbitrary shape.

The function implementing each block is capable of working in the backward direction as well, in order to compute derivatives. This is done by passing a third optional argument \verb!dzdy! representing the derivative of the output of the network with respect to $\by$; in this case, the function returns the derivatives \verb![dzdx,dzdw] = vl_nn<block>(x,w,dzdy)! with respect to the input data and parameters. The arrays \verb!dzdx!, \verb!dzdy! and \verb!dzdw! have the same dimensions of \verb!x!, \verb!y! and \verb!w! respectively (see Section~\ref{s:backward}).

Different functions may use a slightly different syntax, as needed: many functions can take additional optional arguments, specified as a property-value pairs; some do not have parameters  \verb!w! (e.g. a rectified linear unit); others can take multiple inputs and parameters, in which case there may be more than one \verb!x!, \verb!w!, \verb!dzdx!, \verb!dzdy! or \verb!dzdw!. See the rest of the chapter and MATLAB inline help for details on the syntax.\footnote{Other parts of the library will wrap these functions into objects with a perfectly uniform interface; however, the low-level functions aim at providing a straightforward and obvious interface even if this means differing slightly from block to block.}

The rest of the chapter describes the blocks implemented in \vlnn, with a particular focus on their analytical definition. Refer instead to MATLAB inline help for further details on the syntax.

% ------------------------------------------------------------------
\section{Convolution}\label{s:convolution}
% ------------------------------------------------------------------

The convolutional block is implemented by the function \verb!vl_nnconv!. \verb!y=vl_nnconv(x,f,b)! computes the convolution of the input map $\bx$ with a bank of $K$ multi-dimensional filters $\bff$ and biases $b$. Here
\[
 \bx\in\real^{H \times W \times D}, \quad
 \bff\in\real^{H' \times W' \times D \times D''}, \quad
 \by\in\real^{H'' \times W'' \times D''}.
\]
Formally, the output  is given by
\[
y_{i''j''d''}
=
b_{d''}
+
\sum_{i'=1}^{H'}
\sum_{j'=1}^{W'}
\sum_{d'=1}^D
f_{i'j'd} \times x_{i''+i'-1,j''+j'-1,d',d''}.
\]
The call \verb!vl_nnconv(x,f,[])! does not use the biases. Note that the function works with arbitrarily sized inputs and filters (as opposed to, for example, square images). See Section~\ref{s:impl-convolution} for technical details.

\paragraph{Padding and stride.} \verb!vl_nnconv! allows to specify  top-bottom-left-right paddings $(P_h^-,P_h^+,P_w^-,P_w^+)$ of the input array and subsampling strides $(S_h,S_w)$ of the output array:
\[
y_{i''j''d''}
=
b_{d''}
+
\sum_{i'=1}^{H'}
\sum_{j'=1}^{W'}
\sum_{d'=1}^D
f_{i'j'd} \times x_{S_h (i''-1)+i'-P_h^-, S_w(j''-1)+j' - P_w^-,d',d''}.
\]
In this expression, the array $\bx$ is implicitly extended with zeros as needed.

\paragraph{Output size.} \verb!vl_nnconv! computes only the ``valid'' part of the convolution; i.e. it requires each application of a filter to be fully contained in the input support. The height of the output array (the width following the same rule) is then
\[
  H'' = H - H' + 1.
\]
When padding is used, the input array is virtually extended with zeros, so the height of the output array is larger:
\[
  H'' = H - H' + 1 + P_h^- + P_h^+.
\]
When downsampling is used as well, this becomes
\[
  H'' = 1 + \left\lfloor \frac{H - H' + P_h^- + P_h^+}{S_h} \right\rfloor.
\]
In general, the padded input must be at least as large as the filters: $H +P_h^- + P_h^+ \geq H'$.

\paragraph{Receptive field size and geometric transformations.} Very often it is useful to relate geometrically the indexes of the various array to the input data (usually images) in term of coordiante transformations and size of the receptive field (i.e. of the image region that affects an output). This is derived later, in Section~\ref{s:receptive}.

\paragraph{Fully connected layers.} In other libraries, a \emph{fully connected blocks or layers} are linear functions where each output dimension depends on all the input dimensions. \vlnn does not distinguishes between fully connected layers and convolutional blocks. Instead, the former is a special case of the latter obtained when the output map $\by$ has dimensions $W''=H''=1$. Internally, \verb!vl_nnconv! handle this case more efficiently when possible.

\paragraph{Filter groups.} For additional flexibility, \verb!vl_nnconv! allows to group channels of the input array $\bx$ and apply to each group different subsets of filters. To use this feature, specify as input a bank  of $D''$ filters $\bff\in\real^{H'\times W'\times D'\times D''}$ such that $D'$ divides the number of input dimensions $D$. These are treated as $g=D/D'$ filter groups; the first group is applied to dimensions $d=1,\dots,D'$ of the input $\bx$; the second group to dimensions $d=D'+1,\dots,2D'$ and so on. Note that the output is still an array $\by\in\real^{H''\times W''\times D''}$.

An application of grouping is implementing the Krizhevsky and Hinton network~\cite{krizhevsky12imagenet} which uses two such streams. Another application is sum pooling; in the latter case, one can specify $D$ groups of $D'=1$ dimensional filters identical filters of value 1 (however, this is considerably slower than calling the dedicated pooling function as given in Section~\ref{s:pooling}).

% ------------------------------------------------------------------
\section{Convolution transpose (deconvolution)}\label{s:convt}
% ------------------------------------------------------------------

The \emph{convolution transpose} block (some authors call it improperly deconvolution) implements an operator this is the transpose of the convolution described in Section~\ref{s:convolution}. This is  implemented by thefunction \verb!vl_nnconvt!. Let:
\[
 \bx\in\real^{H \times W \times D}, \quad
 \bff\in\real^{H' \times W' \times D \times D''}, \quad
 \by\in\real^{H'' \times W'' \times D''}, \quad
\]
be the input tensor, filters, and output tensor. Imagine operating in the reverse direction and to use the filter bank $\bff$ to convolve $\by$ to obtain $\bx$ as in Section~\ref{s:convolution}; since this operation is linear, it can be expressed as a matrix $M$ such that  $\vv \bx = M \vv\by$; convolution transpose computes instead $\vv \by = M^\top \vv \bx$.

Convolution transpose is mainly useful in two settings. The first one are the so called \emph{deconvolutional networks}~\cite{zeiler14visualizing} and other networks such as convolutional decoders that use the transpose of convolution. The second one is implementing data interpolation. In fact, as the convolution block supports input padding and output downsampling, the convolution transpose block supports input upsampling and output cropping. 

Convolution transpose can be expressed in closed form in the following rather unwieldy expression (derived in Section~\ref{s:impl-convolution-transpose}):
\begin{multline*}
y_{i''j''d''} = 
 \sum_{d'=1}^{D}
 \sum_{i'=0}^{q(H',S_h)}
 \sum_{j'=0}^{q(W',S_w)}
f_{
1+ S_hi' + m(i''+ P_h^-, S_h),\ %
1+ S_wj' + m(j''+ P_w^-, S_w),\ %
d'',
d'
} 
\times \\
x_{
1 - i' + q(i''+P_h^-,S_h),\ %
1 - j' + q(j''+P_w^-,S_w),\ %
d'
}
\end{multline*}
where
\[
m(k,S) = (k - 1) \bmod S,
\qquad
q(k,n) = \left\lfloor \frac{k-1}{S} \right\rfloor,
\]
$(S_h,S_w)$ are the vertical and horizontal \emph{input upsampling factors},  $(P_h^-,P_h^+,P_h^-,P_h^+)$ the \emph{output crops}, and $\bx$ and $\bff$ are zero-padded as needed in the calculation. Furthermore, given the input and filter heights, the output height is given by
\[
  H'' = S_h (H - 1) + H' -P^-_h - P^+_h.
\]
The same is true for the width.

To gain some intuition on this expression, consider only the vertical direction, assume that the crop parameters are zero, and that $S_h>1$. Consider the output sample $y_{i''}$ where $S_h$ divides $i''-1$; this sample is computed as a weighted summation of $x_{i'' / S_h},x_{i''/S_h-1},...$ in reverse order. The weights come from the filter elements $f_1$, $f_{S_h}$,$f_{2S_h},\dots$ with a step of $S_h$. Now consider computing the element $y_{i''+1}$; due to the rounding in the quotient operation $q(i'',S_h)$, this output is a weighted combination of the same elements of the input $x$ as for $y_{i''}$; however, the filter weights are now shifted by one place to the right: $f_2$, $f_{S_h+1}$,$f_{2S_h+1}$, $\dots$. The same is true for $i''+2, i'' + 3,\dots$ until we hit $i'' + S_h$. Here the cycle restarts after shifting $\bx$ to the right by one place. Effectively, convolution transpose works as an \emph{interpolating filter}.

Note also that filter $k$ is stored as a slice $\bff_{:,:,k,:}$ of the 4D tensor $\bff$.

% ------------------------------------------------------------------
\section{Spatial pooling}\label{s:pooling}
% ------------------------------------------------------------------

\verb!vl_nnpool! implements max and sum pooling. The \emph{max pooling} operator computes the maximum response of each feature channel in a $H' \times W'$ patch
\[
y_{i''j''d} = \max_{1\leq i' \leq H', 1 \leq j' \leq W'} x_{i''+i-1',j''+j'-1,d}.
\]
resulting in an output of size $\by\in\real^{H''\times W'' \times D}$, similar to the convolution operator of Section~\ref{s:convolution}. Sum-pooling computes the average of the values instead:
\[
y_{i''j''d} = \frac{1}{W'H'}
\sum_{1\leq i' \leq H', 1 \leq j' \leq W'} x_{i''+i'-1,j''+j'-1,d}.
\]
Detailed calculation of the derivatives are provided in Section~\ref{s:impl-pooling}.

\paragraph{Padding and stride.} Similar to the convolution operator of Sect.~\ref{s:convolution}, \verb!vl_nnpool! supports padding the input; however, the effect is different from padding in the convolutional block as pooling regions straddling the image boundaries are cropped. For max pooling, this is equivalent to extending the input data with $-\infty$; for sum pooling, this is similar to padding with zeros, but the normalization factor at the boundaries is smaller to account for the smaller integration area.

% ------------------------------------------------------------------
\section{Activation functions}\label{s:activation}
% ------------------------------------------------------------------

\vlnn supports the following activation functions:
%
\begin{itemize}
\item \emph{ReLU.} \verb!vl_nnrelu! computes the \emph{Rectified Linear Unit} (ReLU):
\[
 y_{ijd} = \max\{0, x_{ijd}\}.
\]

\item \emph{Sigmoid.} \verb!vl_nnsigmoid! computes the \emph{sigmoid}:
\[
 y_{ijd} = \sigma(x_{ijd}) = \frac{1}{1+e^{-x_{ijd}}}.
\]
\end{itemize}
%
See Section~\ref{s:impl-activation} for implementation details.

% ------------------------------------------------------------------
\section{Normalization}\label{s:normalization}
% ------------------------------------------------------------------

% ------------------------------------------------------------------
\subsection{Cross-channel normalization}\label{s:ccnormalization}
% ------------------------------------------------------------------

\verb!vl_nnnormalize! implements a cross-channel normalization operator. Normalization applied independently at each spatial location and groups of channels to get:
\[
 y_{ijk} = x_{ijk} \left( \kappa + \alpha \sum_{t\in G(k)} x_{ijt}^2 \right)^{-\beta},
\]
where, for each output channel $k$, $G(k) \subset \{1, 2, \dots, D\}$ is a corresponding subset of input channels. Note that input $\bx$ and output $\by$ have the same dimensions. Note also that the operator is applied across feature channels in a convolutional manner at all spatial locations.

See Section~\ref{s:impl-ccnormalization} for implementation details.

% ------------------------------------------------------------------
\subsection{Batch normalization}\label{s:bnorm}
% ------------------------------------------------------------------

\verb!vl_nnbnorm! implements batch normalization~\cite{ioffe2015}. Batch normalization is somewhat different from other neural network blocks in that it performs computation across images/feature maps in a batch (whereas most blocks process different images/feature maps individually). \verb!y = vl_nnbnorm(x, w, b)! normalizes each channel of the feature map $\mathbf{x}$ averaging over spatial locations and batch instances. Let $T$ the batch size; then
\[
\mathbf{x}, \mathbf{y} \in \mathbb{R}^{H \times W \times K \times T},
\qquad\mathbf{w} \in \mathbb{R}^{K},
\qquad\mathbf{b} \in \mathbb{R}^{K}.
\]
Note that in this case the input and output arrays are explicitly treated as 4D tensors in order to work with a batch of feature maps. The tensors  $\mathbf{w}$ and $\mathbf{b}$ define component-wise multiplicative and additive constants. The output feature map is given by
\[
y_{ijkt} = w_k \frac{x_{ijkt} - \mu_{k}}{\sqrt{\sigma_k^2 + \epsilon}} + b_k,
\quad
\mu_{k} = \frac{1}{HWT}\sum_{i=1}^H \sum_{j=1}^W \sum_{t=1}^{T} x_{ijkt},
\quad
\sigma^2_{k} = \frac{1}{HWT}\sum_{i=1}^H \sum_{j=1}^W \sum_{t=1}^{T} (x_{ijkt} - \mu_{k})^2.
\]
See Section~\ref{s:impl-bnorm} for implementation details.

% ------------------------------------------------------------------
\subsection{Spatial normalization}\label{s:spnorm}
% ------------------------------------------------------------------

\verb!vl_nnspnorm! implements spatial normalization. Spatial normalization operator acts on different feature channels independently and rescales each input feature by the energy of the features in a local neighborhood . First, the energy of the features is evaluated in a neighbourhood $W'\times H'$
\[
n_{i''j''d}^2 = \frac{1}{W'H'}
\sum_{1\leq i' \leq H', 1 \leq j' \leq W'} x^2_{
i''+i'-1-\lfloor \frac{H'-1}{2}\rfloor,
j''+j'-1-\lfloor \frac{W'-1}{2}\rfloor,
d}.
\]
In practice, the factor $1/W'H'$ is adjusted at the boundaries to account for the fact that neighbors must be cropped. Then this is used to normalize the input:
\[
y_{i''j''d} = \frac{1}{(1 + \alpha n_{i''j''d}^2)^\beta} x_{i''j''d}.
\]
See Section\ref{s:spnorm} for implementation details.

% ------------------------------------------------------------------
\subsection{Softmax}\label{s:softmax}
% ------------------------------------------------------------------

\verb!vl_nnsoftmax! computes the softmax operator:
\[
 y_{ijk} = \frac{e^{x_{ijk}}}{\sum_{t=1}^D e^{x_{ijt}}}.
\]
Note that the operator is applied across feature channels and in a convolutional manner at all spatial locations. Softmax can be seen as the combination of an activation function (exponential) and a normalization operator. See Section~\ref{s:impl-softmax} for implementation details.

% ------------------------------------------------------------------
\section{Losses and comparisons}\label{s:losses}
% ------------------------------------------------------------------

% ------------------------------------------------------------------
\subsection{Log-loss}\label{s:loss}
% ------------------------------------------------------------------

\verb!vl_logloss! computes the \emph{logarithmic loss}
\[
 y = \ell(\bx,c) = - \sum_{ij} \log x_{ijc}
\]
where $c \in \{1,2,\dots,D\}$ is the ground-truth class. Note that the operator is applied across input channels in a convolutional manner, summing the loss computed at each spatial location into a single scalar.  See Section~\ref{s:impl-loss} for implementation details.

% ------------------------------------------------------------------
\subsection{Softmax log-loss}\label{s:sfloss}
% ------------------------------------------------------------------

\verb!vl_softmaxloss! combines the softmax layer and the log-loss into one step for improved numerical stability. It computes
\[
y = - \sum_{ij} \left(
x_{ijc} - \log \sum_{d=1}^D e^{x_{ijd}}
\right)
\]
where $c$ is the ground-truth class. See Section~\ref{s:impl-sfloss} for implementation details.

% ------------------------------------------------------------------
\subsection{$p$-distance}\label{s:pdistance}
% ------------------------------------------------------------------

The \verb!vl_nnpdist! function computes the $p$-th power $p$-distance between the vectors in the input data $\bx$ and a target $\bar\bx$:
\[
  y_{ij} = \left(\sum_d |x_{ijd} - \bar x_{ijd}|^p\right)^\frac{1}{p}
\]
Note that this operator is applied convolutionally, i.e. at each spatial location $ij$ one extracts and compares vectors $x_{ij:}$. By specifying the option \verb!'noRoot', true! it is possible to compute a variant omitting the root:
\[
  y_{ij} = \sum_d |x_{ijd} - \bar x_{ijd}|^p, \qquad p > 0.
\]
See Section~\ref{s:impl-pdistance} for implementation details.

%% ------------------------------------------------------------------
%\subsection{Product}\label{s:product}
%% ------------------------------------------------------------------
%
%\[
% y_{ijd} = x^{(1)}_{ijd} x^{(2)}_{ijd}
%\]
%
%\paragraph{Implementation details.}
%\[
% \frac{dz}{dx^{(1)}_{ijd}}
%  =
% \sum_{i''j''d''}
%  \frac{dz}{dy_{i''j''d''}} 
%  \frac{dy_{i''j''d''}}{dx^{(1)}_{ijd}}
%  =
%  \frac{dz}{dy_{ijd''}} 
%  x^{(2)}_{ijd},
%  \qquad
%  \frac{dz}{dx^{(2)}_{ijd}}
%   =
%  \frac{dz}{dy_{ijd}} 
%  x^{(1)}_{ijd}.
%\]
%
%
%% ------------------------------------------------------------------
%\subsection{Split}\label{s:split}
%% ------------------------------------------------------------------
%
%\[
% y_{ijd}^{(1)} = x_{ijd}, \qquad y_{ijd}^{(2)} = x_{ijd}
%\]
%
%\[
% \frac{dz}{dx_{ijd}} =
%\sum_{i''j''d''} 
% \frac{dz}{dy_{i''j''d''}^{(1)}}
%  \frac{dy_{i''j''d''}^{(1)}}{dx_{ijd}}
% +
%  \frac{dz}{dy_{i''j''d''}^{(2)}}
%  \frac{dy_{i''j''d''}^{(2)}}{dx_{ijd}}
%\]
